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Given a finite class of functions F, the problem of aggregation is to construct a procedure with a 
risk as close as possible to the risk of the best element in the class. A classical procedure (PAC- 
Bayesian statistical learning theory (2004) Paris 6, Statistical Learning Theory and Stochastic 
Optimization (2001) Springer, Ann. Statist. 28 (2000) 75-87) is the aggregate with exponential 
weights ( AEW) , defined by 

/A ew = Ye (f) f where e( f ) = exp(-(n/T)fl (/)) 

3 f£ KJU KJ> E 9eF e X p(-(n/r)i?„(g))' 

where T > is called the temperature parameter and Rn(-) is an empirical risk. 

In this article, we study the optimality of the AEW in the regression model with random 
design and in the low-temperature regime. We prove three properties of AEW. First, we show 
that AEW is a suboptimal aggregation procedure in expectation with respect to the quadratic 
risk when T < ci, where c\ is an absolute positive constant (the low-temperature regime), and 
that it is suboptimal in probability even for high temperatures. Second, we show that as the 
cardinality of the dictionary grows, the behavior of AEW might deteriorate, namely, that in 
the low-temperature regime it might concentrate with high probability around elements in the 
dictionary with risk greater than the risk of the best function in the dictionary by at least an 
order of 1/^/n. Third, we prove that if a geometric condition on the dictionary (the so-called 
"Bernstein condition") is assumed, then AEW is indeed optimal both in high probability and in 
expectation in the low-temperature regime. Moreover, under that assumption, the complexity 
term is essentially the logarithm of the cardinality of the set of "almost minimizers" rather than 
the logarithm of the cardinality of the entire dictionary. This result holds for small values of the 
temperature parameter, thus complementing an analogous result for high temperatures. 
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1. Introduction and main results 

In this note we study the problem concerning the optimality of the AEW in the regression 
model with random design. To formulate the problem, we need to introduce several 
definitions. 

Let Z and X be two measure spaces, and set Z and Z\, . . . , Z n to be n + 1 i.i.d. random 
variables with values in Z. From a statistical standpoint, T> = [Z\, . . . , Z n ) is the set of 
given data at our disposal. The risk of a measurable real-valued function / defined on 
X is given by 

R(f)=EQ(Z,f), 

where Q : Z x C(X) \— > R is a non-negative function, called the loss function and C(X) is 
the set of all real- valued measurable functions defined on X. If / is a statistic constructed 
using the data T>, then the risk of / is the random variable 

R(f) = E[Q(Z,f)\V]. 

Throughout this article, we restrict our attention to functions /, loss functions Q, and 
random variables Z for which \Q(Z,f)\ < b almost surely (Note that some results have 
been obtained in the same setup for unbounded loss functions in [7, 13, 32], and [4].) 
The loss function on which we focus throughout most of the article is the quadratic loss 
function, defined whcnZ=(X,Y") by Q((X, Y), f) = (Y — f(X)) 2 . 

In the aggregation framework, one is given a finite set F of real- valued functions defined 
on X, usually called a dictionary. The problem of aggregation (see, e.g., [7, 10], and [31]) 
is to construct a procedure, usually called an aggregation procedure, that produces a 
function with a risk as close as possible to the risk of the best element in F. Keeping 
this in mind, one can define the optimal rate of aggregation [16, 26], which is the smallest 
price, as a function of the cardinality of the dictionary M and the sample size n, that 
one has to pay to construct a function with a risk as close as possible to that of the best 
element in the dictionary. We recall the definition for the "expectation case;" a similar 
definition for the "probability case" can be formulated as well (see, e.g., [16]). 

Definition 1.1 ([26]). Let b > 0. We say that (V'n(Af))n,AfeN* is an optimal rate of 
aggregation in expectation when there exist two positive constants, cq and c\, depending 
only on b, for which the following holds for any n € N* and M £N* : 

1. There exists an aggregation procedure /„ such that for any dictionary F of cardi- 
nality M and any random variable Z satisfying \Q(Z,f)\ < b almost surely for all 
f G F, one has 

ER(f n )<mbxR(f) + cotp n (M); (1.1) 

2. For any aggregation procedure f n , there exists a dictionary F of cardinality M and 
a random variable Z such that \Q(Z, f) \ <b almost surely for all f € F and 



ER(fn) > miniJ(/) + cMM). 
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In our setup, one can show (cf. [26]) that in general, an optimal rate of aggregation 
(in the sense of [26] [optimality in expectation] and of [16] [optimality in probability]) 
is lower-bounded by (logM)/n. Thus, procedures satisfying an exact oracle inequality 
like (1.1) — that is, an oracle inequality with a factor of 1 in front of mhif e pR(f) — with 
a residual term of ip n (M) = (log M)/n are said to be optimal. Only a few aggregation 
procedures have been shown to achieve this optimal rate, including the exponential ag- 
gregating schemes of [2, 3, 7, 13, 31], the the "empirical star algorithm" in [3], and the 
"preselection/convexification algorithm" in [16]. For a survey on optimal aggregation 
procedures, see the HDR dissertation of J.-Y. Audibcrt. 

Our main focus here is on the problem of the optimality of the aggregation procedure 
with exponential weights (AEW). This procedure originate from the thermodynamic 
standpoint of learning theory (see [8] for the state of the art in this direction). AEW can 
be viewed as a relaxed version of the trivial aggregation scheme, which is to minimize 
the empirical risk 

n 

n £ — ' 

»=l 

in the dictionary F. 

A procedure that minimizes (1.2) is called empirical risk minimization (ERM). It is 
well known that ERM generally cannot achieve the optimal rate of (log M)Jn, unless one 
assumes that the given class F has certain geometric properties, which we discuss below 
(see also [13, 18, 21]). To have any chance of obtaining better rates, one has to consider 
aggregation procedures that take values in larger sets than F. The most natural set is 
the convex hull of F. AEW is a very popular candidate for the optimal procedure, and it 
was one of the first procedures to be studied in the context of the aggregation framework 
[2, 4, 7, 9, 13, 15, 20, 31]. It is defined by the following convex sum: 



f ~A*w = y df whcrc ? = 



for the dictionary F = . . ., /m}- The parameter T > is called the temperature. 1 

Thus far, there have been three main results concerning the optimality of the AEW. 
The first of these is that the progressive mixture rule is optimal in expectation for T 
larger than some parameters of the model (see [4, 7, 13, 30, 32] and [3]), and under 
certain convexity assumption on the loss function Q. This procedure is defined by 

/=^E/^ W > (1-4) 

fc=i 

where J^-EW is the function generated by AEW (with a common temperature parame- 
ter T) associated with the dictionary F and constructed using only the first k observations 

1 This terminology comes from thermodynamics, since the weights (8i , . . . , 9m) can be seen as a Gibbs 
measure with temperature T on the dictionary F. 
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Z\,...,Zk- (See [3] for more details and for other procedures related to the progressive 
mixture rule.) 

Second, the optimality in expectation of AEW was obtained by [9] for the regres- 
sion model Yi = f(xi) + Si with a deterministic design x\,...,x n G X with respect 
to the risk \\g — f\\„ = nr 1 Y^i=i^di Xi ) ~ /( x i)) 2 (with its empirical version being 
Rn(g) — n~ x Y^i=i(Yi ~ 9( x i)) 2 )- That is. it was shown that for T > cmax(6, a 2 ), where 
a 2 is the variance of the noise £, 

E||/ AEW - /II?, < nhn \\g -f\\ 2 n + (1.5) 

Finally, [1, 2], and [8] proved that in the high-temperature regime, AEW can achieve the 
optimal rate (logAf )/n under the Bernstein assumption, recalled below in Definition 1.3 
in expectation and in high probability. This result is discussedin more detail later. 

Despite the long history of AEW, the literature contains no results on the optimality (or 
suboptimality) of AEW in the regression model with random design in the general case 
(when the dictionary does not necessarily satisfy the Bernstein condition). In this article, 
we address this issue and complement the results (assuming the Bernstein condition) 
of [1, 2, 8] for the low-temperature regime by proving the following: 

- AEW is suboptimal for low temperatures T < c\ (where c\ is an absolute positive 
constant), both in expectation and in probability, for the quadratic loss function 
and a dictionary of cardinality 2 (Theorem A). 

- AEW is suboptimal in probability for some large dictionaries (of cardinality M ~ 
\/nTogn) and small temperatures T < c\ (Theorem B). 

- AEW achieves the optimal rate (log M)/n for low temperatures under the Bern- 
stein condition on the dictionary (Theorem C). Together with the high-temperature 
results of [1, 2] and [8], this proves that the temperature parameter has almost no 
impact (as long as T — 0(1)) on the performance of the AEW under this condition, 
with a residual term of the order of ((T + 1) log M)/n for every T > 0. 

Theorem A. There exist absolute constants cq, . . . , C5 for which the following holds. For 
any integer n > Cq, there are random variables {X, Y) and a dictionary F = f<^\ such 
that (Y — fi(X)) 2 < 1 almost surely for i = 1,2, for which the quadratic risk of the AEW 
satisfies the following: 

1. if T < c\ and n is odd, then 

Ei?(/ AEW )>mini?(/) + ^; 

2. if T < c^^/n/ logn, then, with probability greater than C4, 

i?(/ AEW )>mini?(/) + ^. 



Theorem A proves that AEW is suboptimal in expectation in the low-temperature 
regime and suboptimal in probability in both the low- and high-temperature regimes, 
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since it is possible to construct procedures that achieve the rate C/n with high probability 
[3, 16] and in expectation [3, 4, 7, 13, 30, 32] in the same setup as for Theorem A. It 
should be noted that the problem of the optimality in probability of the progressive 
mixture rule (and other related procedures) was studied by [3], who proved that, for a 
loss function Q satisfying some convexity and regularity assumption (e.g., the quadratic 
loss used in Theorem A), the progressive mixture rule / defined in (1.4) satisfies that for 
any temperature parameter, with probability greater than an absolute constant Co > 0, 

In addition, it is important to observe that the suboptimality in probability does 
not imply suboptimality in expectation for the aggregation problem, or vice versa. This 
property of the aggregation problem was first noted by [3], who found the progressive 
mixture rule (and other related aggregation procedures) to be suboptimal in probability 
for dictionaries of cardinality two but, on the other hand, to be optimal in expectation 
([7, 30, 32] and [13]). This peculiar property of the problem of aggregation comes from the 
fact that an aggregate / is not restricted to the set F, which allows R( f) — min/ e i? R(f) 
to take negative values. [3] showed that for the progressive mixture rule /, these negative 
values do compensate on average for larger values, but there is still an event of constant 
probability on which R(f) — min/ 6 i?i?(/) takes values greater than C/y/n. 

The proof of Theorem A shows that a dictionary consisting of two functions is sufficient 
to yield a lower bound in expectation in the low-temperature regime and in probability 
in both the small temperature regime, < T < ci, and the large temperature regime, 
c\ < T < C3y/ri/ logn. In the following theorem, we study the behavior of AEW for larger 
dictionaries. To the best of our knowledge, negative results on the behavior of exponential 
weights based aggregation procedures are not known for dictionaries with more than two 
functions, and we show that the behavior of the AEW deteriorates in some sense as the 
cardinality of the dictionary increases. 

Theorem B. There exist an integer no and absolute constants C\ and ci for which the 
following holds. For every n>no, there are random variables (X, Y) and a dictionary 
F = {/i, . . . , /m} of cardinality, M = [ciy^nTogn] , for which the quadratic loss function 
of any element in F is bounded by 2 almost surely, and for every < a < 1/2, ifT< c%a, 
then with probability at least 1 — cj > (a)n a ~ 1 ^' 1 , 

i?(/ AEW )>mini?(/) + C4 ( a) y^. 

Moreover, if f F £ F denotes the optimal function in F with respect to the quadratic loss 
(the oracle), then there exists fj ^ fp with an excess risk greater than c^(a)n^ 1 / 2 and 
for which the weight of fj in the AEW procedure satisfies 9j > 1 — n~ Ce ( a ^ T . 

Theorem B implies that the AEW procedure might cause the weights to concentrate 
around a "bad" element in the dictionary (i.e., an element whose risk is larger than the 
best in the class by at least ^n -1 / 2 ) with high probability. In particular, Theorem B 
provides additional evidence that the AEW procedure is suboptimal for low temperatures. 
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The analysis of the behavior of AEW for a dictionary of cardinality larger than two 
is considerably harder than in the two-function case and requires some results on rear- 
rangement of independent random variables that are almost Gaussian (see Proposition 5.2 
below). Fortunately, not all is lost as far as optimality results for AEW go. Indeed, we 
show that under some geometric condition, AEW can be optimal and in fact can even 
adapt to the "real complexity" of the dictionary. 

Intuitively, a good aggregation scheme should be able to ignore the elements in the 
dictionary whose risk is far from the optimal risk in F, or at least the impact of such 
elements on the function produced by the aggregation procedure should be small. Thus, 
a good procedure is one with a residual term of the order of ip/n, where ip is a complexity 
measure that is determined only by the richness of the set of "almost minimizers" in the 
dictionary. This leads to the following question: 

Question 1.2. Is it possible to construct an aggregation procedure that adapts to the 
real complexity of the dictionary? 

This question was first addressed by the PAC-Bayesian approach. [1, 2] and [8] showed 
that in the high-temperature regime, AEW satisfies the requirements of Question 1.2, 
assuming that the class has a geometric property, called the Bernstein condition. 

Definition 1.3 ([5]). We say that a function class F is a (/?, B)- Bernstein class (0 < 
f3 < 1 and B > 1 ) with respect to Z if every f £ F satisfies E/ > and 

E(f 2 (Z))<B(Ef(Z)f. (1.6) 

There are many natural situations in which the Bernstein condition is satisfied. For 
instance, when Q is the quadratic loss function and the regression function is assumed to 
belong to F, the excess loss function class Cf = {Q{-, f) — Q(', fp)'- / G F} satisfies the 
Bernstein condition with ft = 1, where f F € F is the minimizer of the risk in the class F. 
Another generic example is when the target function Y is far from the set of targets with 
"multiple minimizers" in F and Cf satisfies the Bernstein condition with (3 = 1. (Sec 
[21, 22] for an exact formulation of this statement and related results.) 

The Bernstein condition is very natural in the context of ERM because it has two 
consequences: that the empirical excess risk has better concentration properties around 
the excess risk, and that the complexity of the subset of F consisting of almost minimizers 
is smaller under this assumption. Consequently, if the class Cf is a (/?, B)-Bernstein class 
for < f3 < 1, then the ERM algorithm can achieve fast rates (see, e.g., [5] and references 
therein). As the results below show, the same is true for AEW. Indeed, under a Bernstein 
assumption, [1,2] and [8] proved that if R(-) is a convex risk function and if F is such that 
\Q(Z,f)\ < b almost surely for any f € F, then for every T > ci max{&, B} and x > 0, 
with probability greater than 1 — 2exp(— x), 

R(f AEW ) < mini?(/) + ^ fx + logfT exp(-(n/2T)(i*(/) - R(f* F )))\ ) . (1.7) 
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Although the PAC-Bayesian approach cannot be used to obtain (1.7) in the low- 
tcmpcrature regime (T < c\ max{6, £?}), such a result is not surprising. Indeed, because 
fast error rates for the ERM are expected when the underlying excess loss functions class 
satisfies the Bernstein condition, and because AEW converges to the ERM when the 
temperature T tends to 0, it is likely that for "small values" of T, AEW inherits some 
of the properties of ERM, such as fast rates under a Bernstein condition. We show this 
in Theorem C, proving that AEW answers Question 1.2 for low temperatures under the 
Bernstein condition. 

Before formulating Theorem C, we introduce the following measure of complexity. For 
every r > 0, let 

Hr) = log(|{/ G F; R(f) - R(f* F ) < r}\ + 1) 

OO 

+ 2" j log(|{/ e F: V- X r < R(f) - R(fp) < 2MI + 1), 

3=1 

where \A\ denotes the cardinality of the set A. 

Observe that tp(r) is a weighted sum of the number of elements in F that assigns 
smaller and smaller weights to functions with a relatively large excess risk. 



Theorem C. There exist absolute constants cq, c\,c%, and C3 for which the following 
holds. Let F be a class of functions bounded by b such that the excess loss class Cf 
is a (1,-B) -Bernstein class with respect to Z. If the risk function R(-) is convex and 
if T < comaxjft, B}, then for every x > 0, with probability at least 1 — 2exp(— x), the 
function J AEW produced by the AEW algorithm satisfies 

R(f AEW ) < R(f* F ) + c,(b + B)£±M, 



where 6 = c 2 (b + 5) (log \F\)/n. 
In particular, 



ER(f AbjW )<R(f F ) + c 3 (b + B 



In other words, the scaling factor 9 that we use is proportional to (6 + £?)(log \F\)/n, 
and if the class is regular (in the sense that the complexity of F is well spread and not 
concentrated just around one point), then ip(9) is roughly the cardinality of the elements 
in F with risk at most ~(b + B)(log\F\)/n. 

Observe that for every r > 0, ij)(r) < clog \F\ for a suitable absolute constant c. Thus, 
if T is reasonably small (below a level proportional to max{_B,6}), then the resulting 
aggregation rate is the optimal one, proportional to (b + B)(x + log M) j n with probability 
1 — 2exp(— x), and proportional to (6 + B)(logM)/n in expectation. Thus, Theorem C 
indeed gives a positive answer to Question 1.2 in the presence of a Bernstein condition 
and for low temperatures. 
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Although the residual terms in Theorem C and in (1.7) are not the same, they are 
comparable. Indeed, the contribution of each element in F in the residual term depends 
exponentially on its excess risk. 

Theorem C together with the results for high temperatures from [1, 2] and [8] show 
that the AEW is an optimal aggregation procedure under the Bernstein condition as long 
as T = 0(1) when M and n tend to infinity. In general, the residual term obtained is on 
the order of {{T + 1) log M)/n, and it can be proven that the optimal rate of aggregation 
under the Bernstein condition is proportional to (log M)/n using the classical tools in [28] . 

Finally, a word about the organization of the article. In the next section we present 
some comments about our results. The proofs of the three theorems follow in the subse- 
quent sections. Throughout, we denote absolute constants or constants that depend on 
other parameters by ci, C2, etc. (Of course, we specify when a constant is absolute and 
when it depends on other parameters.) The values of constants may change from line to 
line. We write a ~ b if there are absolute constants c and C such that bc< a< Cb, and 
write a < b if a < Cb. 



2. Comments 

Although from a theoretical standpoint, whether AEW is an optimal procedure in ex- 
pectation and for high temperatures in the regression model with random design remains 
to be seen, from a practical standpoint, we believe that exponential aggregating schemes 
simply should not be used in the setup of this article, because of the following reasons 
(see also the comments in [3]): 

1. For any temperature T < coy/n/ \ogn, there is an event of constant probability on 
which AEW performs poorly (this is the second part of Theorem A). 

2. If the temperature parameter is chosen to be too small, then the AEW can perform 
poorly even in expectation (the first part of Theorem A). 

Another consequence of the lower bounds stated in Theorem A is that AEW cannot be 
an optimal aggregation procedure both in expectation and in probability at low temper- 
atures for two other aggregation problems: the problem of convex aggregation, in which 
one wants to mimic the best element in the convex hull of F, and the problem of linear 
aggregation, where one wishes to mimic the best linear combination of elements in F. 
Indeed, clearly 

min R(f) > min R(f) > min R(f). 

f£F /gconv(F) /espan(F) 

Moreover, the optimal rates of aggregation for the convex and linear aggregation prob- 
lems for dictionaries of cardinality two are of the order of n _1 (see [14, 17, 26]), whereas 
the residual terms obtained in Theorem A are on the order of n -1 / 2 for such a dictio- 
nary. Thus AEW is suboptimal for these two other aggregation problems in the low- 
temperature regime. 

We end this section by comparing two seemingly related assumptions, the margin 
assumption of [27] and the Bernstein condition of [5] . Note that in the proof of Theorem C, 
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we have restricted ourselves to the case /? = 1 simply to make the presentation as simple 
as possible. A very similar result, with the residual term ((x + ip(6)) /n) 1 ^ 2 ^^ for the 
exact oracle inequality in probability and (ip(6) /n) 1 /' 2- ^ for the exact oracle inequality 
in expectation, holds if one assumes a Bernstein condition for any < j3 < 1, and the proof 
is identical to that in the case where (5 = 1 . This makes the discussion about /3-Bernstein 
classes relevant here. 

Recall the definition of the margin assumption: 

Definition 2.1 ([27]). We say that F has margin with parameters (ft,B) (0 < j3 < 1 
and B > 1 ) if for every f G F, 

mQ(zj) - q(z, nf) < B(R(f) - R(nf, 

where f* is defined such that R(f*) = mill/ R(f), and the minimum is taken with respect 
to all measurable functions f on the given probability space. 

Although the margin condition appears similar to the Bernstein condition, they are in 
fact very different, and have been introduced in the context of different types of problems. 
In the first of these, the "classical" statistical setup, one is given a function class F (the 
model) with an upper bound on its complexity and an unknown target function /*, the 
minimizer of the risk over all measurable functions. One usually assumes that /* belongs 
to F, and the aim is to construct an estimator / = /(•,£>) for which the risk R(f) tends 
to quickly as the sample size tends to infinity. In this setup, the margin assumption can 
improve this rate of convergence because of a better concentration of empirical means of 
/) — Q('i /*) around its mean [27]. The margin assumption (MA) for (3 = 1 compares 
the performance of each f € F with the best possible measurable function, but it has 
nothing to do with the geometric structure of F. The margin is determined for every / 
separately, because /* docs not depend on the choice of F. 

In the second type of problem, the "learning theory" setup, one does not assume that 
the target function /* belongs to F. The aim is to construct a function / with a risk 
as close as possible to that of the best element f F <G F. Assuming that the excess loss 
class Cf satisfies the Bernstein condition (BC), the error rate can be improved (see, 
e.g., [5, 22]). 

At a first glance, MA and BC (for j3 = 1) share very strong similarities. Indeed, saying 
that Cf is a (1,B)-Bernstein class means that for every / € F, 

E((Q(Z, f) - Q(Z, f F )f) < B(R(f) - R(f*)), 

but nevertheless they are different. Indeed, as mentioned earlier, MA is only a matter 
of concentration (and classical statistics questions are mostly a question of the trade-off 
between concentration and complexity). On the other hand, BC involves a lot of geometry 
of the function class F, because f F might change significantly by adding a single function 
to F or by removing a function. In fact, the difficulty of learning theory problems is 
determined by the trade-off between concentration and complexity, and the geometry of 
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the given class, since one measures the performance of the learning algorithm relative 
to the best in the class. Assuming that /* <G F, as is usually done in classical statistics, 
exempts one from the need to consider the geometry of F, but one does not have that 
freedom in the aggregation framework. Indeed, since in the AEW algorithm the estimator 
is determined by the empirical means R n {f) — Rn(fp), this is a learning problem rather 
than a problem in classical statistics, despite the fact that it has been used in statistical 
frameworks to construct adaptive estimators (see, e.g., [2, 4, 6, 11, 15, 20, 25, 27, 31]). 
Therefore, given their nature, aggregation procedures like the AEW are more natural 
under a BC assumption than under the MA. (A by-product of Theorem A is that the 
MA cannot improve the performance of AEW since in the setup of Theorem A, it is easy 
to check that MA is satisfied with the best possible margin parameter (3 = 1.) 

3. Preliminary results on Gaussian approximation 

Our starting point is the Bcrry-Esscen theorem on Gaussian approximation. Let (W n ) n ^ 
be a sequence of i.i.d., mean-0 random variables with variance 1, set g to be a standard 
Gaussian variable, and write 



Theorem 3.1 ([23]). There exists an absolute constant A > such that for every inte- 
ger n, 



From here on, we let A denote the constant appearing in Theorem 3.1. 

When the tail behavior of the Wi has a subcxponcntial decay, the Gaussian approxi- 
mation can be improved. Indeed, recall that a real- valued random variable W belongs to 
L^, a for some a > 1 if there exists < c < oo such that 



The infimum over all constants c for which (3.1) holds defines an Orlicz norm, which is 
called the ip a norm and is denoted by j| ■ \\^ a . (For more information on Orlicz norms, 
see, e.g., [29] and [24].) 

Proposition 3.2 (Chapter 5 in [23]). For every L>0, there exist constants Bq,ci, 
and C2 that depend only on L for which the following holds. If \\W\\jp 1 < L, then for any 
x > 0, such that x < Bon 1 / 6 , 




sup|P[A„<x]-P[5<a;]|< 



AElWxl 3 



Eexp(|WT/c Q ) <2. 



(3.1) 



¥[X n >x]= F[g > x] exp 



( 



X' 



3 EW 3 \ ' 
) . 



1 + 



( 



jg + l V 
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and 



*[X n < -x] = P[g < -x] exp 



x 3 EW s 



1 + 



x + 1 



where by v — 0(u) we mean that —c\u <v< C\U. 
In particular, if \x\ < B^n 1 ^ and KW 3 = ; then 

\V[X n <x]- P[g < x}\ < c 2 (?i~ 1/2 exp(-:E 2 /2)). 
From here on, we let Bq denote the constant appearing in Proposition 3.2. 



4. Proof of Theorem A 

Before presenting the proof of Theorem A, we introduce the following notation. Given 
a probability measure v and selected independently according to i/, we set 

P„ = ^ the empirical measure supported on (Zi)™ =1 . We let P denote the 

expectation E„. We assume that T < 1 and recall that n is an odd integer. 

Let Y = and define X by P[X = 1] = 1/2 - n" 1 / 2 and F[X = -1] = 1/2 + n" 1 / 2 . Let 
fi = l[o,i] an d f 2 = 1[— i oi) an d consider the dictionary F = {/i, / 2 }. It is easy to verify 
that the best function in F (the oracle) with respect to the quadratic risk is f\ , and that 
the excess loss function of f 2 , C 2 = /f — /1 = /2 — /1, satisfies that 

£ 2 (X) = -X, E£ 2 (A) = 2n~ 1/2 and a 2 = E(£ 2 (A) - E£ 2 (A)) 2 = 1 - 4/n. 



To simplify notation, set P£ 2 = E£ 2 (X) and P ra £ 2 



Er=iA(^). 



An important parameter that lies at the heart of this counterexample is the Bernstein 
constant (which is very bad in this case), 



E(/i - f2? 



PC 2 2 ' 

Straightforward computation shows that AEW on F with temperature T is given by 



(4.1) 



;aew = i/i + (1 _0 i)/2) 



l + exp(-(n/T)P„£ 2 )' 
and that for h{9) = 9 + ad (I - 9) defined for all 9 e [0, 1], 

E[P(/ AEW ) - R{f{)] = E[l - 9 1 - a0i(l - 0i)]P£ 2 =E[1 - fr(0i)]P£ 2 



1- / fc'(i)P[0 x >i]di 



P£ 2 



1+ / (2crf - (1 + a))P[0i > t] dt 
Jo 



(4.2) 



P£ 2 



1+ / (2ai- (l + a))P[P„£ 2 > 7(f)] di 




PC 2 
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where 7(f) is an increasing function defined for any t£ (0,1) by 



In particular, 



for 



and 



7 (* ) = -log -— 
n VI — t 



E [R(f^)-R(f 1 )] = [I 1 +I 2 }PC 2 



h = / (2ai - (1 + a))P[P„£ 2 > 7 (i)] dt + 1 



(2o* - (1 + a))P[P„£ 2 > 7 (<)] dt. 



First, we bound 1\ from below. To that end, we note the following facts. First, for 
every < t < a -1 , 1 + a — 2at > and 



(2at- (l + a))di = -l. 

Second, if we set E = cxp(nP£ 2 /T), then for T < ^/n/logra, < (1 + P) -1 < a -1 . In 
particular, this holds under our assumption that T < 1. Moreover, because 7 is increasing, 
for (1 + P)- 1 <* <a -1 , 7(0 >7((1 + = --P£ 2 . Therefore, 



/•a 1 

h= (2at-(l + a))F[P n L 2 >j{t)]dt + l 
Jo 

-1 

(2a* - (1 + a))(P[P„£ 2 > 7 (f)] - l)dt 



> / (l+a-2at)P[P„£ 2 < 7 (t)]dt 

-'(l+B)- 1 



> 



/ (1 + a - 2crf) dt • P[(vAI/<7)(P„£2 - ^£2) < (VE/a)(-2PC 2 )} 

J(l+E)-i 



l(l + E) 
pa 1 

> / (1 + a - 2at) dt(¥[g < -8] - AJ^/n) > cq > 0, 

where in the last step we used the Berry-Esseen theorem, with |£ 2 | < 1 and n > 8 V 
(2A/P[g < -8]) 2 , implying that < c < 1/2. 
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We turn to a lower bound for I2 ■ Applying a change of variables 1 1— > 1 + a 1 - u in 
the second term of I2 , it is evident that 



h = 



(a+l)/(2a) 



(a+l)/(2a 
(a+l)/(2a 



(2crf - (1 + a))P[P„£ 2 > 7(*)] ^ 

(2a* - (1 + a))P[P„£ 2 > j(t)} dt 
(2at - (1 + a))P[ 7 (t) < P„£ 2 < 7(1 + oT l - *)] dt = J 3 + 1 4 



for 



and 



(l+c /4)cT 



(2at - (1 + a))P[7(*) < P n C 2 < 7(1 + a" 1 - t)] dt 



(a+l)/(2a) 



(2at - (1 + a))P[7(t) < P ra £ 2 < 7(1 + ^ - *)] dt. 



'(l+co/4)a- 

To estimate J3, note that 2ai — (1 + a) < for t <G [a -1 , (a + l)/(2a)], and thus 

rCl+co/^o- 1 



(2at - (1 + a)) dt > — - 1 + - > — - 



1 



Co 



for our choice of a. 

The final step of the proof is to bound I4 and in particular to show that for small 
values of T, I4 > — co/3. 

For any < t < (a + l)/(2a), consider the intervals ir(t) = [n-f(t), 77,7(1 + a~ 1 — t)), and 
set iVr(t) = \{Ix(t) nZ}|, which is the number of integers in ir(t)- Because £2^) = —X, 



\l{t)<PnC 2 <l{l + a- 1 -*)]=: 



J2-x t ei T (t) 



= Pr(t). 



Recall that X G {-1, 1}, and thus P[£. -X, e 7 T (t)] = P^, -X, G 7 T (t) n Z]. Because 
nj(t) is increasing and non-negative for t > 1/2, then if 1/2 < t < (a + l)/(2a), it follows 
that < nf(t) < 717(1 + 1/a — t) < 1, provided that T < 1. Thus, for such values of t, 
Nx(t) = 0, implying that P T (t) = 0. On the other hand, if t < 1/2, then {0} C I T (t) n 
Z. In particular, if Nr(t) — 1, then ir(t) H Z = {0}, and since n is odd, then Pt(£) = 
p E"=i = 0] = 0. Otherwise, iV T (t) > 2, which implies that N T (t) < 2A T (t), where 
A-r(t) is the length of Ir(t), given by 



A T (t) = 71(7(1 + a" 1 - 1) - 7(t)) = Tlog 



(l-t)(a + l-at) 
t(at - 1) 
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Therefore, for every t in our range, 



» T (t) <N T (t) max ] 

k£l T (t) 



< 2 A T (t) max I 

fcez 



Since 2at — (1 + a) < for every < t < (a + 1)/ (2a), it is evident that 



J 4 > 2TmaxI 



^ = k 



(a+l)/(2a) 



(l+c /4)a- 



(2at- (1 + a))log 



/(l-i)(a+l-ai) 
t(ai - 1) 



dt. 



It can be shown that maxfc S zP[^™ =1 X, = k] is on the order of nT 1 / 2 either by a direct 
computation or by the Berry-Esseen theorem. Moreover, for any (1 + co/4)a _1 < t < 
(a + 1)/ (2a), one has at — 1 > cq(4 + co) _1 ai, and thus, 



log 



(l-t)(a + l-at) 
t{at - 1) 



<log 



2(4 + c ) 
c t 2 



Therefore, combining the two observations with a change of variables u = Ct for C 
(cq/(2(4 + cq))) 1 / 2 , it is evident that there are absolute constants ci,C2 for which 



ciT 



(C(a+l)/(2a)) Ta 

(1 + a — 2au/C){iogu) du > —c 2 —=. 

C(l+c /4)a- 1 V n 



Thus, there is an absolute constant C3 such that if T < C3, then I4 > — co/3, implying 
that 



[R(f AiLW )-R(h)]> 



and proving the first part of Theorem A. 

To prove the second part of the theorem, note that by the Berry-Esseen theorem, for 
every igl, with probability greater than P[g < x] — 2A/^/n, 



(P n C 2 - PC 2 ) < x. 



Thus, if n is large enough to ensure that P[<? < —4] — 2A/^/n > P[<? < — 4]/2 = C4, and 
taking x = —4, then with probability at least C4, P n L 2 < — "nT l l 2 . In that case, 9\ < 
exp(— 1/n/T), which yields that 

i?(/ AEW ) - R{h) = (1 - Si - aBxil - £)) • PC 2 > PC 2 /4 = n' 1 ' 2 /2, 

provided that T < yfnj log n. 
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The first step in the proof of Theorem B involves a general statement regarding a mono- 
tone rearrangement of independent random variables that are close to being Gaussian. 
Let W be a mean 0, variance 1 random variable that is absolutely continuous with respect 
to the Lcbcsgue measure. Further assume that \W\ has a finite third moment (in fact, the 
random variables in which we are interested are bounded) and set /3(W) = AE,\W\ 3 , where 
A is the constant appearing in the Berry-Esseen theorem (Theorem 3.1). Let W\, . . . , W n 
be independent random variables distributed as W and set X = ft- -1 / 2 X)"=i Let 
(-Xj')j=i be I independent copies of X, and put 71 = ji(£) £ K to satisfy that 



min Xa <7i(.£) 
i<j<e J 



1 



Note that such a 71 exists because W has a density with respect to the Lebesgue measure. 
Throughout the proof of Theorem B, we require the following simple estimates on 71. 

Lemma 5.1. There exist absolute constants cq, . . . , C3 for which the following hold: 
1- -(/ ^ ^ Co logn, then 

l-^<P[X>7 1 ]<l-c 1 ^. 

2. Ifi and n are such that (P(W)/y/ri + (logn)/£) < ¥[g < -2], then 71 < -2. 
3- Ifji<—2 and Co\ogn<£<C2f3~ 1 {W)y/n\ogn, then 



| 7 i|~log 



1/2 



c 3 £ 

logn 



and cxp (- 7l V2)^^log 1 / 2 



c 3 l 
log n 



Before we present the proof of Lemma 5.1, recall that for every x > 2, 
3 exp (-.T 2 /2) ^ p[ ^^ 1 eM-x 2 /2) 



4V27T 



2n 



(5.1) 



Proof of Lemma 5.1. To prove the first part, note that by independence and because 
cxp(— x) > 1 — x, 



>[X>7i] 



min Xj > 71 

l<j<t 



> 1 



logn 



(5.2) 



The reverse inequality follows in an identical fashion, because exp(— x) < 1 — x/3 if < 
x<l. 

Turning to the second part, if 71 > —2, then 



1-1 = 

n 



min X, < — 71 

i<j<e 



> 



min Xj < — 2 

i<j<-« 



= 1 - (F[X > -2] 
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implying that P[X < —2) < (\ogn)/£. On the other hand, by the Berry-Esseen theorem, 
P[X < —2] > P[g < —2] — f3(W)/\/n, which is impossible under the assumptions of (2). 

Finally, to prove (3), we use the Berry-Esseen theorem combined with the lower and 
upper estimates on the Gaussian tail (5.1) and (5.2). Thus, 

3 1 / l7l| 2 A m r l »„7 , P(W) /3(W) log 77. 

exp -UR. < P[ 5 < 7l ] < P[X < 71] + t^-L < + c i^> 



1 1 ( \li\ 2 \\ogn [3(W) 

ex P — z~ ^ — B t^- 



4v / 27t |7i 
and 



V2ti|7i| 

from which both parts of the third claim follow. □ 

Proposition 5.2. There exist constants c\,Ci,C3, and C4 that depend only on \\W\\^, 2 
for which the following holds. Let 2M 2 exp(— cin 1 / 3 ) < 5 < 1, and assume that KW S — 
and that 71 = 71 (A/ — 1) < —2. Then 

F[3j G {2, . . . , M}: X 3 < 7l and for every k G {2, . . . , M} \ {j},X k ~ X 3 > 5] 
1 ( 1 



>1 c 2 — +5 (logn)VlogM, 

n \y/n J 

provided that C3 log n< M < C4 y^log 77) . 

Proof. For every 2<j< M, let 

% = {Xj < 71 and X k -Xj>6 for every k G {2, . . . , M} \ {j}}. 

The events fi^ for 2 <j < M are disjoint, and thus 

F[3j G {2, . . . , M}: X; < 71 and X fe - Xj > 5 for every fc G {2, . . . , M} \ {j}] 

= (M-i)p[n a ]. 



b=2 



Since the variables {Xj)jL 2 are independent, we have 

/71 / poo \ Af— 2 

-00 / 

where /jj is a density function of X with respect to the Lebesgue measure /x. 

On the other hand, for any z < 71, P[X > z] > because of (5.2). Thus, for every 

2<7l> 

f°° f x (t)d»(t)= (l-L^M^^) . f°°Mt)Mt)- (5-3) 
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Note that for every < x < 1, (1 - x) M - 2 > 1 — (M — 2)x, and applied to (5.3), 



71 /poo \ M — 2 

2 ]> / /*(*)(/ /xWd^Wj d M (z) 

71 / /*oo \ iw— 3 / /*z-\-5 



(M-2) f x (z)[ fx(t)Mt)) / f x (t)dfi(t))d f i(z) 



> P[^2 < 7i and X fe > X 2 , for everyfc > 3] - T 2 
min Xj < 71 



M —1 



2<j<M 



To 



where 



T 2 = (M - 2) f * fx(z) ( [ Z+d f x {t) dfi(t) ) d M (z). 



Recall the if (Wj) are independent mean-0 random variables and (aj) are real num- 
bers, then || J2 a iWi\\^ 2 < c(^2 aj || WiH^) 1 / 2 , where c is an absolute constant [29]. Thus, 

II-XIU2 ^ c IIWIU2> an d f° r an y * < o, 

/iW /i(*)dM*)) <W < P[X < t] < 2exp(-t 2 / C 2 ||^||2 2 ). 

Let to < be such that 



2exp(-t /c H^lk)= (M _ 1)(M _ 2) - 

Thus, 

/ 5V!og(M-l) 



(M -2) J" ^ (z) (7* fx (t) d/i(t)) d M (z) < 



M - 1 



Note that if to > 7i, then our claim follows. Indeed, because F[mm2< j<M Xj < 71] 
1 — we have 



M-l\ n) M - 1 

Otherwise, we split the interval (—00,71] = (—00, to) U [to, 71], and to upper bound T2, it 
remains to control the integral on the second interval [to, 71]. 

Recall that W £ L^, 1 and that W.W 3 = 0. Therefore, by Proposition 3.2, it is evident 
that if z and S satisfy that z < z + 5 < and \z\, \z + S\ < B^n 1 ^ , then 

2+5 

fx (t) dfi(t) = F[z < X < z + 6} 

B (5 - 4) 
< F[z < g < z + 5} + -;L exp(-z 2 /2), 
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where Bq and B\ are constants that depend only on || WH^. In addition, for every z < 0, 



B [z<g<z + S] < -=exp(-z J /2) / exp(-zt)dt< -=cxp(-z 2 /2). (5.5) 

V27T Jo V27t 



If 2M 2 expi-Bfin 1 / 3 /\\W\\l 2 ) <5<1, then \t \ < B n^ 6 . Combining (5.4) and (5.5) 
with the definition of T2, we have 

(M - 2) jT f x (z) [J* +S fx (t) dfx(t)j d/*(z) 

< (M - 2) + ^) /xW exp(-z 2 /2) d/*(z) 

< (M - 2) + ^) cxp(- 7l 2 /2)P[X < 7l ] 

^1 <5 \. „, 2 /o\ log 71 



< (M — 2) —= H — = cxp(-7i72) 



fn V^tJ 'M-r 

where the last inequality follows from (5.2). By Lemma 5.1, and since M < -^/nlogn, 
(M-2) /" /x(t)<W<)) <M*) 

S <^ + «)(^)« to «-'^ 
for some constant c = c(/3), from which our claim follows. □ 

We next describe the construction needed for the proof of Theorem B. Let (X,Y) and 
F = {/1, . . . , f M } be defined by 

Y = 0, 
f 1 (X) = (12) 1 ^U 1 , 

f } (X) = (12) 1/4 (Wj + A) for every 2 < j < M, 

where U\,...,Um are M independent random variables with density mi — > 2{u + 
X)1\-X,i-m( u ) f° r < A < 1/2 to be fixed later. Note that for this choice of density 
function, (Ui + A) 2 is uniformly distributed on [0,1], and the best element in F with 
respect to the quadratic risk is f\. 

Let (Uj^ ■ j = 1, • • ■ , M, i = 1, . . . , n) be a family of independent random variables dis- 
tributed as Ui. Thus, for every 1 < i < n, fj(Xi) = (12) 1 / 4 (W (l) + A) for every 2<j<M 
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and h{Xi) = (12) 1 / 4 W 1 (l) . For every 1 < j < M, set 

f n 

+A) 2 -E(wf+A) 5 



Ri 



and observe that if W = \f\2{{U + A) 2 - E(U + A) 2 ), then W is a mean 0, variance 1 
random variable that is absolutely continuous with respect to the Lebesgue measure and 
W G L^ 2 and satisfies that ~EW 3 = 0. These properties allow us to apply Proposition 5.2 
to the random variables Ri,..., Rm ■ 

Let < p < 1 (to be named later) , and set 



^R 1 ) = R 1 + —\o i 



2(1 -p) 



- V12A(2 - X)y/n, 



and 



-T 
o = —j= log 



|2(M-2)(l-p) 



Consider the system of inequalities 



Rj<Z(Ri), 

Rk — Rj>5 for every k^l,j, 



(Cj) 



and recall that for each j = 1, . . . , M 0j denotes the weight of fj in the AEW procedure. 

Proposition 5.3. There exist absolute constants c\ and ci for which the following holds. 
Let < p < 1/2 and 2 < j < M . If the system (Cj) is satisfied, then 

ej>i- P . 

Moreover, if p< C\\, then the quadratic risk of the function produced by the AEW pro- 
cedure satisfies 

i?(/ AEW )> min R(f) + c 2 X. 

Proof. Let 2 < j < M, and assume that (Cj) is satisfied. Recall that R n (f) is the em- 
pirical risk of /, and note that for any k G {2, . . . , M} \ {j}, 



Rn(h) ~ Rn(fi) = ~ Y,lfk(Xif fKXif] = 

71 . yJTX 

1—1 v 



6 —T 
>^ = log 



2(M-2)(l-p) 



(5.6) 
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In addition, since < 1 — A almost surely for any 1 < i < n, 

1 - 

Rn(h) - Rn(fj) = - ]T[/ipQ) 2 - MX,,) 2 } 



> 



i?i — Rj 



(i) 



12A(2-A)> 



-T 



■ loe 



2(1 -p) 



Combining (5.6) and (5.7), it is evident that 

03 ~ EtU exp[(-n/T)( J R„(/ fe ) - i? n (/;))] 



> 



1 



1 



1 + (M - 2)p/(2(M - 2)(1 - p)) + p/{2{\ - p)) 
Since the functions /i, . . . , /m are independent in L2{X) and E/j > 0, 

/ m \ 2 



a(/ aew )=e E^w 



(5.7) 



\j=i / 

and there is an absolute constant Co for which E/J > E/ 2 + coA. Thus, 

(?,) 2 E/ 2 - E/ 2 > (1 - p)(E/ 2 + c A) - E/ 2 > c 2 A, 
provided that p < ciA, giving 

i?(/ AEW ) > E/ 2 + c 2 A = mini?(/) + c 2 A, 

as claimed. □ 

Next, we formulate a general statement, from which Theorem B follows immediately. 

Theorem 5.4. There exist absolute constants Ci, i = 0, . . . , 5 and an integer no for which 
the following holds. For any n > no, 1 < k < c^nlogn, < T < 1, and c\T / ^Jnlogn < 
e < 1/8, Ze£ M = |~c 2 V n log n | , A = 036-^/ (logn)/?i, and p = n~ eK / T . Set F to be the class 
of functions defined above with those parameters. Then, with probability at least 



l-c 4 (£K + T+l)((log 3 n)/n) 



(l-2e) 2 /2 
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there exists j > 2 such that 

0,- > 1 - 



In particular, with the same probability and if < T < min{l, 2ek}, 



i?(/ AEW )>mini?(/) + C5 eJ^. 

j G r V /i 

Proof. Set 

P = F[3j € {2, . . . , M} such that 6j>l-p], 
and, by Proposition 5.3, 

Po > P[3j S {2, . . . ,M} for which (Q) is satisfied] = Pi. 

Let 7i = 7i(M — 1) be defined by P[min 2 <j<A/ -Rj < 7i] = 1 — n^ 1 , and observe that 
71 is well defined and satisfies all three parts of Lemma 5.1 for I = M — 1. Set 

fi o = U0Ri)>7i}, 

A = {3j e {2, . . . , M}: i? 3 < and - J?j > 6 for every fc ^ l,j} 

and 

S = {3j e {2, . . . , M}: i?j < 7i and i? fc ~ Rj > 5 for every fc ^ 1, j}. 
Since the functions iij-, j = 1, . . . , M are independent, we have 

Pi > E Ri [P[A|i?i]l O0 ] > P[B]P[J2 ]- 
Applying Proposition 5.2, we then have 

P[B] > 1 - - - c 2 (4= + S ) (logn) 2 VTo^M, 

provided that C3 logn < M < C4y / n(log n) . 
To lower bound P[f2 ], note that 



P[fio] 



^ 1 >7 1 -^log(^) + ^A(2-A)Vi 



Fix < e < 1/8 and assume that X,p and T are such that 

\/l2A(2-A)v^<-£7i and -^ lo g( 2(1 P < ~£7i- (5-8) 
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By the Berry-Esseen theorem and (5.1), 



P[fi ] > P[Ri > (1 - 2e) 7 i] = 1 - P[i?i < (1 - 2e) 7 i] 
>l-P[.,<(l-2e hl ]-WI) 

1 



> 1 - 



9 A 

— exp(-(l-2e) 2 7l 2 /2)-_ 
27t(l-2e)|7i| Vn 



and by Lemma 5.1, 

cxp(-(l-2 £ ) 2 7 2 /2)< C5 

Therefore, 



lo S"- , 1/2 f C 5M 
-log ' 1 



17 - 1 



\logn 



(1-26 



*o > 1 c 2 

n 



+ <5 (log n)Vlog M) • 1-CB 



, 3 \ (l-2e) : 

log n x 
M 



provided that c 2 log n < M < C3 ^/nlogn. 

To complete the proof, we need to chose A and p for which (5.8) holds. By Lemma 5.1, 



l7i|>log 1/2 



and thus (5.8) holds for A and p for which 



M 



A < C$E 



■log 



M 
log r. 



nl/2 



and p > 2 exp 



-C9£\/",__l/2 



r 



log 1 



M 
logn 



In particular, when we take M~ -y/n logn, A~ e((log A/)/™) 1 / 2 , and p = n~ £K / T , p sat- 
isfies the required condition as long as e > T / \J n log n and K < -y/ n/logn, as assumed. 
Moreover, 



logn 



implying that 



»o>l-CB(e« + r+l) 



, 3 \ (l-2e) 2 /2 

log n x 



The lower bound on the risk of the AEW procedure now follows from Proposition 5.3. □ 
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6. Proof of Theorem C 

In this section we prove Theorem C, which we reformulate below. From here on, we 
assume that the dictionary F is finite, consisting of M functions, and that the functions 
are indexed according to their risk in an increasing order. Thus, f\ = f F . In addition, we 
denote £/(•) = Q(; /) - Q(; fx), and thus R(f) - R(fc) = EC f . 
For every r > 0, recall that 

^(r)=log(|{/GF: E£,<r}| + 1) 

OO 

+ 2 ~ 3 log(|{/ G F: 2 j ~ 1 r <EC f < Vr}\ + 1), 
j'=i 

which serves as a measure of complexity for the class F. 

The first component needed in the proof of Theorem C is the level A(x) with the 
following property: with probability at least 1 — 2exp(— x), R n (fj) — R n (fi) is equivalent 
to R(fj) — R(fi) if R{fj) — R{h) > A(x). This "isomorphism" constant was introduced 
by [5]. To formulate the exact properties that we need, first recall the following definitions 
and notation. 

If G = Cf is the excess loss functions class {£/: / <E F}, then let star(G, 0) = {9g: < 
< 1,3 G G) is the star-shaped hull of G and 0. Set G r = star(G, 0) n {g: Eg = r}, that 
is, the set of functions in the star-shaped hull of Cf and 0, with expectation r. Let 

r* = infjr: E sup \P n g - Pg\ < r/2), 
L gee > 

where, as always, P n denotes the empirical mean and P is the mean according to the 
underlying probability measure of Z . 

Theorem 6.1 ([5]). There exists an absolute constant c for which the following holds. 
Let F be a class of functions bounded by b, such that Cf is a (1, B)- Bernstein class. For 
every x > and an integer n, let 

\{x) =cmaxjr*,(6 + £)- j. (6.1) 

Then, with probability at least 1 — 2exp(— x), for every f S F with R(f) — R(f F ) ^ A(x), 

Rn{f)-Rn{f F )>\{R{f)-R{r F ))- 

Let p = Ki(B + b)/n, where n\ is an absolute constant to be named later. Recall that 
functions in F arc indexed according to their risk in an increasing order. Let J_(x) = 
{j: R(fj) — R{fi) < A(x)}, and set J+(x) as its complement. Define the sets J + .o = {j € 
J+(x): R(fj) - R{fi) < p} and, for k > 1, 

J+M = {j G J+(x): 2 k - 1 p< Rfa) - R{h) < 2 k p}. 
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(Note that some of the sets J+ ; fc may be empty.) Set 



fc = sup{fc > 0: 2 k < log(| J + , k \ + 1)}, 



andlet / = J_UU fe < feo ^+,fc- 

From Theorem 6.1, it follows that for every k > and every j <E J+,k, Rn(fj) — 
Rn(fp) > \{ R ifj) - R(f* F ))- This is because R(fj) - R(f F ) > X(x) by the definition 
of J+(x), and J+{x) D J+.k- 

The key factor in the proof of Theorem C is Theorem 6.2. 

Theorem 6.2. There exist absolute constants c\ and C2 for which the following holds. 
Let F be a class of functions bounded by b, such that Cp is a (1, B)- Bernstein class with 
respect to a convex risk function R. Then, with probability at least 1 — 2 exp(— x), if J AEW 
is produced by the AEW algorithm and T < ci(b + B), then 



where X(x) is as defined in (6.1). 

Proof. Let (0 3 -)jii bc thc weights of the AEW algorithm, and set / AEW = Y,f=ihfi 
to be the aggregate function. Because R is a convex function, 



Note that for every j e I, R(f 3 ) - R(fi) < X(x) + 2 ka p = \{x) + Kl 2 k «(b + B)/n. In 



On the other hand, with probability at least 1 — 2exp(— x), for every k > kg and every 
j G J+,k, 



Applying the definition of the weights in the AEW algorithm and given that 6\ < 1, 




(6.2) 




particular, because X^=i ®i = 1> 



J2h(R(fj)-R(h)) < \(x) + Ki2 k °(b + B)/n. 



Rn(fj) ~ Rn{fl) > (R(fj) ~ R(fl))/2. 



^ OiiRfa) - R(h)) = ^ £ MRifj) R(h)) 
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<E E ex p(-^Wi)-%))]Wi)- fi (/i))=w 

k>k j£J+,k 



From the definition of ko, it is evident that for every k > ko, 2 k > log | J+,fe|, and thus if 
T < c\ max{i>, B} and k\ is sufficiently large, then 



(*) < exp(log|J +>fe | - ^2 fe -Vy>< J2 exp(-c 2 ^2 k pj2 k p<c 3 ^-. 

k>k ^ ' k>k ^ ' 



Indeed, this follows because for that choice of T, (n/T)2 k ° p > C4, with C4 an absolute 
constant. 

Thus, with probability at least 1 — 2cxp(— x), 



R(f) - R(fi) < Hx) + Ki2 ko (b + B)/n + C3- < \(x) + c 5 2 fco ^±^, 

n 



as claimed. 



□ 



The next step in the proof of Theorem C requires several simple facts regarding the 
empirical process indexed by a localization of the star-shaped hull of a Bernstein class. 
First, it is simple to verify that the star-shaped hull of a (1, £?)-Bernstcin class is a 
(1, 5)-Bcrnstein class as well. Second, if G = star(£i?,0) and G r = {h G G: Eh = r}, then 

j>l f J j>l 

In particular, 



E sup 

heG T 



1 ™ 

- ^ - Eft, 



<VE sup 
~1 heH T , : 



1 n 

-^2h(Zi)-Eh 



Lemma 6.3. There exists an absolute constant c for which the following holds. If Cf is 

a (1, B)- Bernstein class with respect to Z , then for every r and j > 1, 



E sup \P n h — Ph\ < cmax< 



fw-ilog(|fr rii | + l) /log(|ff rj -| + l) 



-^rB2-3 



Proof. Fix r > and j > 1, and let 



D= sup ( -V/i 2 (Z,)] 



1/2 



20 



G. Lecue and S. Mendelson 



Note that every h £ H r j satisfies that h = rCf/ECf for some f £ F, and for which 
ECf > r2 J_1 . Therefore, using the Bernstein condition on Cf, 

2 ~^ 2 <rB2-^. 



Eh z =r 



(E£/) 2 



Moreover, ||/i||oo < ( r /^EC/)|j£/||oo < b2 Thus, by the Gine-Zinn symmetrization 

theorem and a contraction argument (see, e.g., [12] and [19]), 



ED < E sup 

heH T 



< ^=E Z E e Sup 



b2-i+ 2 
< — E Z E, sup 



1 

h 2 (ZA-Eh 

71 ' 

1 " 

i ™ 



rD2^' +1 

rB2" 3+1 



i=l 



r 



B2- J+1 



< - r- J\og{\H r , 3 \ + l)ED + rB2-i+\ 

where the last inequality is evident by the sub-Gaussian properties of the Rademacher 
process (cf. [19]). Since ED < (ED 2 ) 1 / 2 , it follows that 



ID 2 < c b2- 



uo-j+2. / 1 °g(l g '-jl + 1 ) ( - EjD 2^i/2 ! r B2-i +l , 



implying that 



£D 2 < Cl max|6 2 2- 2 ^ 



Thus, again using a symmetrization argument and the sub-Gaussian properties of the 
Rademacher process, we have 



E sup 



n 

-Y'htZ^-Eh 

17 



<-^ x /log(\H rd \ + l)ED 



< C3 max- 



62-Jlog(|g r , 3 -| + l) ^ / log(|g r , J -| + 1) ^^-| _ 
n ' V n 



□ 



Corollary 6.4. There exist absolute constants c\ and ci for which the following holds. 
Let F be a finite class consisting of M functions bounded by b, such that the excess loss 
class Cf is a (1, B)- Bernstein class. If we set 9 = c±(b + B)(logM)/n, then 



* 

T < C-2 



b + B 
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Proof. Observe that for every r > 0, 



E sup 

h£G T 



1 - 

- Vfi(Zj) -Eh 



<VE sup 



n 4 

i=l 

< dmaxj -^2-^log(| J ff r , J -| + 1), 7^^2^/ 2 v /log(|H. rj | + 1) 



< Cl- ( fogd^ol + 1) + X] 2 ~ J logd^r.j | + 1) 



+ Ci v v ( v^i^ ' + ^ +E 2 ~ i/ V log ^i + l) 

= u(r), 

where we define H r . = {(r£/)/(E£ / ): /eF,E£/<r}. Let f = inf{r: u(r) < r/2}. Since 
\H r j\ < M for every j > 0, we have 



u(r) < C2 max< b- 



logM rB log M 



and thus 



r < c 3 (6 + B) (log M)/n = I 



Moreover, the functions of r, 



log(|ff r>0 | +1) + XV J 'l0g(|ff r j| + 1), 
3>l 



and 



'log(|ff r ,0| + 1) + E 2^' /2 v/logd^,, | + 1), 

are increasing, and thus for any r < 9, 



- ( log(|ff r , | + 1) + J2 2 ~ 3 ^E(\H r .j I + 1) 

3>l 



< - ( log(|ff fl ,o| + 1) + E 2 "' l °s(\H0,j\ + 1) 
j>i 
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\o g (\H rfi \ + 1) + ^2-^ 2 v /iog(|tf rj | + 1; 



< 



Thus, if we consider 



r = c 3 - 



log(|F e , | + 1) + J2^~ j/ y^g(\He, j \ + 1) 

3>l 



log(\H 0fi \ + 1) +^2- J 'log(|H ej | + r 

3>1 



c 3 -( ^/log(|^.o| + 1) + ^2^/ 2 v /log(| J ff ( , J | +1) 



< C4 



6 + S 



V>(#) 



for appropriate constants C3 and C4, then r <6. Thus, u(r) < r/2 and, therefore, 



r < C4 



Finally, because 



E sup |P„/i - Ph\ < u{r) 
heG r 



and r* = inf{r: Esup ffgGr \Pn,g — Pg\ <r/2}, we have r* < • 



□ 



Proof of Theorem C. The proof of Theorem C follows from estimates of \{x) and 2 fe ° . 
From Corollary 6.4, it is evident that 



X(x) < c\ max 



b + B 



logM 



(b + BY 



where c\ is an absolute constant to be identified later. (Note that -0 is an increasing 
function.) 

Next, by the definition of ko, 2 k ° < logM. Therefore, using the notation of Theorem 6.2, 



fe<fe ( 

and, in particular 

2 k ° < log 



|J {f f j G J + . k } C \f f Rift) R(fi) < Kl (b + B) l -^\ 



U {/rJ£ J «} 



k<ko 
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< log 



f f R(Jj)-R(J i)<«i( 



B 



log M 



i <iog(|fr fl , | + i), 



for an appropriate choice of constant c\ . 

The second part of Theorem C follows from a standard integration argument. 



□ 
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